========================================================
This report explores a dataset containing attributes of approximately 4900 wine samples
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ S.No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## S.No fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The dataset contains 13 variables with almost 4900 observations.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
Most of the white wine observations have quality rating between 5 to 7: Median 6 and Mean 5.8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Although the count of observations is at peak at 9.5 percentage alocohol content, most of the observations have alcohol percentage between 9 to 11.Median and Mean alcohol content is 10.40 and 10.51
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The pH value is normally distributed among the observations and most of them lie between the range of 2.9 to 3.4 on pH scale. Median is 3.180 and Mean is 3.188
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Density value is normally distributed with very few outliers.Most of the wine observations lie between the range of 0.991 to 0.996. Median is 0.9937 and Mean is 0.9940
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The first plot for fixed acidity is normally distributed with few outliers. To better understand the distribution of fixed acidity level, boxplot is done with outliers and the highest count of wine samples is at around 6.8 of fixed acidity level.Most of the wine samples lies between the fixed acity range of 6.3 to 7.3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The first plot for volatile acidity is normally distributed with few outliers. To better understand the distribution of volatile acidity level, boxplot is done with outliers and the highest count of wine samples is at around 0.28 of fixed acidity level.Most of the wine samples lies below the volatile acity range of 0.32.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric acid content is normally distributed with outliers.To better understand the distribution of citric acid level, boxplot is done with outliers and the highest count of wine samples is at around 0.34 of citric acid level.Most of the wine samples lies below the range of 0.39.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Distribution of residual sugar is right skewed with more than 75% of white wine samples having residual sugar content below 10 and having few outliers.To better understand the distribution of residual sugar, boxplot is done and the highest count of wine samples is at around 6 of residual sugar content level.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## [1] 1163
White wine samples have very low level of chlorides content and more than 75% of samples have below 0.05 chlorides content. Count of samples with chlorides content above 0.05 is 1163. There is a wide variation between minimum and maximum chlorides content level with miminum value as 0.009 and maximum value as 0.346.Due to large number of outliers the distribution of chlorides content is skewed far to the right.To better understand the distribution of chlorides the long tail data is transformed on a log scale of 10. The transformed chlorides distribution appears like a normal distribution with the highest count of samples at around 0.045 of Chlorides content value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Distribution of free sulfur dioxide is more or less normal with few outliers.To better understand the distribution of free sulfur dioxide content, boxplot is done with outliers and the highest count of wine samples is at around 35 of free sulfur dioxide content level.75% of the wine samples lie below 46 of free sulfur dioxide content level.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Distribution of total sulfur dioxide is more or less normal with few outliers.To better understand the distribution of total sulfur dioxide content, boxplot is done with outliers and the highest count of wine samples is at around 138.4 of total sulfur dioxide content level.75% of the wine samples lie below 167 of total sulfur dioxide content level.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
75 % of white wine samples have 0.55 sulphates level and mean is 0.4898.
There are 4898 white wine samples with 12 attributes(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality) . Following are the observations from the dataset: 1) Most of the white wine observations have quality score of 6 2) Around 75% of the observations had residual sugar content less than 10 while the minimum and maximum values are 0.6 and 65.800 respectively. 3) Mean and median alcohol content are 10.40 and 10.51 4) pH value is normally distributed among the observations with mean at 3.188 5) There are 4 outliers with fixed acidity greater than 10.5 and 6 outliers with volatile acidity > 0.9
The main features of interest in the dataset are quality, alcohol and residual sugar content. I believe pH and residual sugar with the combination of other attributes can be used to build a predictive model for white wine quality.
Citric acid content, density and Chlorides content may contribute more towards the quality of the wine.
I have not created any new variable from the existing variables as each one is discrete.
I did not perform any operations on the data to tidy, adjust or change the form of the data.
## df$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
## --------------------------------------------------------
## df$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
## --------------------------------------------------------
## df$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
## --------------------------------------------------------
## df$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## df$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## df$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
## --------------------------------------------------------
## df$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
The above plots shows that the high quality wine samples have less content of chlorides.
White wine samples with high content of fixed acidity, volatile acidity, citric acid and residual sugar have less chlorides content. This might have an impact on the quality of wine.
The above plot supports the assumption that with the increase in content of fixed acidity, volatile acidtiy, citric acid and residual sugar, the quality decreases.
Density is related to residual sugar and alcohol. As we can see from the above plot density increases as the residual sugar increases and density decreases as the alcohol content increases.
As expected, as fixed acidity increases the pH value decreases
## df$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.587 4.600 6.393 10.700 16.200
## --------------------------------------------------------
## df$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.300 2.500 4.628 7.100 17.550
## --------------------------------------------------------
## df$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
## --------------------------------------------------------
## df$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## df$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
## --------------------------------------------------------
## df$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.100 4.300 5.671 8.200 14.800
## --------------------------------------------------------
## df$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.60 2.00 2.20 4.12 4.20 10.60
There is a relation between quality and residual sugar.After it reaches quality rating of 5, quality increases with the decrease in residual sugar content.
## df$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.870 3.035 3.215 3.188 3.325 3.550
## --------------------------------------------------------
## df$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.830 3.070 3.160 3.183 3.280 3.720
## --------------------------------------------------------
## df$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.790 3.080 3.160 3.169 3.240 3.790
## --------------------------------------------------------
## df$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.080 3.180 3.189 3.280 3.810
## --------------------------------------------------------
## df$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.840 3.100 3.200 3.214 3.320 3.820
## --------------------------------------------------------
## df$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.940 3.120 3.230 3.219 3.330 3.590
## --------------------------------------------------------
## df$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.200 3.280 3.280 3.308 3.370 3.410
This plot shows that the pH values varies accross each quality rating with the most variation in the quality rating of 6.After the quality rating of 5, the mean pH value increases with the increase in quality.
## df$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## --------------------------------------------------------
## df$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## df$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## df$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## df$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## df$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## df$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
The above plots cleary show that the quality increases with the increase in alcohol content. The summary shows that after the quality rating of 5, the mean value of alcohol content gradually increases with the increase in quality.
There is a strong relation between alcohol content and quality. And in line with my intuition there is a strong corelation between residual sugar and quality.Chlorides content decreases with the increase of fixed acidity, volatile acidity,citric acid and residual sugar.They could also have an impact on quality of wine.
Density had a strong relationship with residual sugar and alcohol content.pH value had a relationship with fixed acidity.
Alcohol content has a strong relationship with quality. Density also has a strong relationship with residual sugar and alcohol content.As expected pH value also has a relationship with fixed acidity.
Most of the high quality wines have low chlorides content which can be seen in the above plot.
Residual sugar content is less in high quality wines.Increase in residual sugar content leads to increase in density.
The above plot shows that the alcohol content is high and chlorides content is less in high quality wine sample.
The above plot shows that the chlorides content reduces with the increase of fixed acidity, volatile acidity, citric acid and residual sugar content.And high quality white wine samples have less chlorides content.
The above plot shows that the density increases with increase in residual sugar content and density decreases with increase in alcohol content. Most of the high quality wine samples have less residual sugar content,high alcohol content and low density.
Here again the above plot shows that most of the high quality wine samples have less residual sugar content and high alcohol content.
The above plot shows that most of the high quality wine samples have less chlorides content and high alcohol content.
Alcohol conent, residual sugar and chlorides content play an important role in determing the quality of wine which can be seen from the above plots
One of the interesting interaction is between density and residual sugar & alcohol content.
## df$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400
## --------------------------------------------------------
## df$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0130 0.0380 0.0460 0.0501 0.0540 0.2900
## --------------------------------------------------------
## df$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
## --------------------------------------------------------
## df$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## df$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
## --------------------------------------------------------
## df$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100
## --------------------------------------------------------
## df$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0180 0.0210 0.0310 0.0274 0.0320 0.0350
The above plot shows that the high quality wine samples have less content of chlorides.The mean value of chlorides content decreases from the quality rating of 6.
The above plot shows that the density increases with increase in residual sugar content and density decreases with increase in alcohol content. Most of the high quality wine samples have less residual sugar content and high alcohol content.
The above plot shows that most of the high quality white wine samples have less chlorides content and high alcohol content.
The White wine data set contains information on almost 4900 thousand white wine samples across 12 attributes . I started by understanding the individual attributes in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of white wine samples across many attributes and created a plots to predict the quality of white wine.
The exploration of white wine dataset shows that the quality of white wine is largly bases on the alcohol, residual sugar and chlorides content.High quality white wine samples have high alcohol content and low residual sugar and chlorides content.
In future I would like to explore impact of acidity feature of the white wine and how does it impact the quality.